Skip to content

tickets/SP-3008: Bug fixes and improved tools for comparison of completed visits and pre-night sims#196

Open
ehneilsen wants to merge 19 commits intomainfrom
tickets/SP-3008
Open

tickets/SP-3008: Bug fixes and improved tools for comparison of completed visits and pre-night sims#196
ehneilsen wants to merge 19 commits intomainfrom
tickets/SP-3008

Conversation

@ehneilsen
Copy link
Collaborator

This started off as a bug fix for comparenight being too picky about how it was matching simulated visits to visits in the prenight simulation, failing to match visits when the pointing was not in the same healpixel. This change replaces the simple healpix-based strategy with coordinate matching taking advantage of astropy's catalog matching tools.

While debugging, I also added several improved tools for showing tables and figures that compare simulated with actual visits. I think these greatly improve on what was in comparenight before. There will be a corresponding PR on schedview_notebooks (which will depend on this PR in schedview) shortly.

Copy link
Member

@rhiannonlynne rhiannonlynne left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have comments, because some of the docstrings could be clearer and some were actually completely opaque as they are.

One thing in particular I had not taken into account was that you were putting together all of the different simulation visits, plus all of the completed visits, into one big data frame. This is not easy to tell from the information that is available in this module.

I suspect that it may be starting to make sense to make a class to handle "all of the data acquisition from a night", including fetching the given set of simulation visits, the completed visits, and then concatenating them in the way that currently only exists in the notebook. You later end up depending on having the data in this particular format, but it's not an obvious format and not created anywhere within this code base -- which to me, means that maybe this should be shifted to a class, with methods to fetch and manipulate the data, so that you can maintain the particular data format you want.

from .multisim import match_visits_across_sims


def assign_field_hpids(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This probably needs a different name, to reduce confusion about what it does. "assign_field_hpids" doesn't imply matching, but one of the requirements for this function is two different data frames to match (and this is what drives the hpid assignment, if I'm getting it right).

Perhaps it could simply be "match_pointings_hpid" ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or maybe "assign_matching_field_hpids"?
I don't know .. just assign_field_hpids is accurate but makes me think you could have done what you probably did first (just assign each pointing to the nearest hpid).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about I modify this to just assign each pointing to the nearest hpid, and then have a separate function, snap_to_referece_hpid_when_close, that changes assigned hpids to one those from a reference set when the coordinates are close, and then have the notebook call both in succession?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By the time you get rid of the matching, there's really nothing left but the call to hp.ang2pix, so maybe we should just drop the assign_field_hpids function entierly and have the notebook call hp.ang2pix directly, followed by snap_to_reference_hpid_when_close.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you're maybe just (at the end of the day) proposing the new name here then?

would then "assign_matching_reference_hpids" or something work?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly, with the difference that the initial setting of the id is by a direct call to hp.ang2pix in the notebook, and a separate call to snap to the reference. This does result in more being done in the notebook, but has the advantage that in the notebook it makes it clear what's going on. I'm happy either with your proposed renaming, or with pulling hp.ang2pix out.

simulated_visits : `pd.DataFrame`
Simulated visit table; must contain columns identified by ``ra_col``
and ``dec_col`` (both in degrees). An integer column named
``hpid_col`` will be added or overwritten.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are the visit pointings that are being used as the basis for the match, correct? Might be useful to make that comment here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And I see (from the notebook) that it's actually the simulated visits from all of the simulations, not a single sim.
Are there failure modes in matching, if different simulations had slightly different field centres?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If different simulations have different field centers, they will match if they have they are close enough to correspond to the same healpix, and not otherwise.
It's not clear to me whether this is actually a failure, though.
What we're trying to do here is to match hpids when a completed visit corresponds to the visit in a simulation.
If different simulations have different field centers, then the completed visit could have come from a scheduler request that matches at most one of them, and the simulation that it is nearest to seems like the best guess (but not guaranteed).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An alternative would be to ignore dithering altogether: instead of matching to field centers in simulation, do the catalog match to the original undithered WFD tessellation, and match both coordinates in completed and all the simulations to these points. This would mean, however, that pointings could be a large fraction of a camera diameter off and still be considered a "match."

and ``dec_col`` (both in degrees). An integer column named
``hpid_col`` will be added or overwritten.

completed_visits : `pd.DataFrame`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will the completed visits hpid be overwritten, when compared against a new simulation?

Copy link
Collaborator Author

@ehneilsen ehneilsen Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it is set already (and inplace=True), yes. I'll note that in the docstring.

Completed visit table; same column requirements as
``simulated_visits``.

field_hpix_nside : `int`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reasonable useful default for this?
I imagine that your testing has shown that some values tend to work pretty well, while others are overkill and others are way too coarse. I'm guessing that either 32 or at the outside 64 are best choices?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've been using 2*18=262144, or about 0.8" resolution, but highter would be just as good. My goal here was to make sure that it is high enough that simulated pointings that are different should have different hpids, and I wasn't concerned with it being too high because matching sims that match each other should have exactly matching coordinates.


field_coord_tolerance_deg : `float`
Maximum angular separation (deg) for a completed visit to be considered
a match to a simulated healpix center.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Likewise, is there a recommendation for this? I imagine it's on the order of the small offsets you've been noticing, which ought to hold up as long as the tessellation we're using matches the tessellation on the summit, for that night.
I imagine that the recommendation is something like
"On a typical night, where the field tessellation on the summit matches the tessellation in the simulation (for wide-area surveys), the tolerance is likely around 0.001 (??) degrees. For dithering around field centres and for mismatches in tessellation, the tolerance might be set larger, potentially up to 1.7 degrees." ( ??)

within_tol = sim_hp_sep.deg < coord_match_tolerance_deg
completed_visits.loc[within_tol, hpid_col] = sim_hpids[sim_hp_match_idx[within_tol]]

return simulated_visits, completed_visits
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was wondering why we couldn't pass in multiple simulated_visits data frames at the same time, but I think it's because you're matching the completed visits to those simulated visits, and then the completed_visits can end up with different healpix ids depending on which simulation it's been matched against.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I think you noticed elsewhere, I pass visits from all simulations in the same DataFrame.

The simulation index to compare against the observation (index 0).
visits : `pd.DataFrame`
Table of visits that contains at least the columns ``sim_index`` and
``start_timestamp``.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand from this docstring what "visits" is .. it seems like it's not just the information from either the completed visits or a single simulation, but potentially both? How are they combined?
Is there some function somewhere you'd use to create the "visits" data frame and could that be referenced here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, visits is the DataFrame with all visits, both from completed and simulated simulations.

Parameters
----------
sim_index : `int`
The simulation index to compare against the observation (index 0).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is sim_index? Is it part of how the "visits" data frame referenced next was created?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's an id that indentifies sequences of visits. That is, the visits DataFrame contains all visits, including the simulations and completed visits. The sim_index column of the visits dataframe identifies which sim (or completed) that specific visit came from.

dec=completed_visits[decl_col].values * u.deg,
frame="icrs",
)
sim_hp_match_idx, sim_hp_sep, _ = completed_coords.match_to_catalog_sky(sim_hp_coords)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that you're using a really high side by default for assigning the sim visits to healpix ids -- could there be a potential failure where:
sim 1 has field centres which are slightly different than the field centres in sim 2
then the completed visits would only match either the healpix in sim 1 or sim 2, but if you'd used a lower nside they would both have actually matched and both been within the tolerance?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my comment on the corresponding comment on the docstring. My intention was to distinguish dithers, so completed visits in one dither not match simulated visits in a different dither, but rather only accommodate differences between the coordinates requested by the FBS and the coordinates reported in ConsDB.

@rhiannonlynne
Copy link
Member

To be honest, I'm still a little confused about the matching, after seeing how you're actually fetching and using the data frame.
With current on-sky visits from the FBS, we don't re-tesselate within the night, but that isn't guaranteed to be true in general (if, for example, we had same-filter pairs, we might want to dither the tessellation within the night to avoid chip gap overlaps), and while we might make that re-tesselation repeatable, I could imagine it being something that is also random or depends on the time of night, etc. So that different simulations might end up with field centres that are offset from each other, which would (at a high side) result in them having different healpix IDs.
But then the real visits that are "close enough" to the centre of the field pointing .. would they match up with only one of the sims (the closest) or would they match all of them (assuming in all cases that the offset in the field centre is less than the distance tolerance defined)?

@ehneilsen
Copy link
Collaborator Author

I suspect that it may be starting to make sense to make a class to handle "all of the data acquisition from a night", including fetching the given set of simulation visits, the completed visits, and then concatenating them in the way that currently only exists in the notebook. You later end up depending on having the data in this particular format, but it's not an obvious format and not created anywhere within this code base -- which to me, means that maybe this should be shifted to a class, with methods to fetch and manipulate the data, so that you can maintain the particular data format you want.

I dislike the idea of putting this into a class because it interferes with code reuse. Most of what I want to do with the visits table involves pandas.DataFrame operations or passing visits as an argument to functions that take a pandas.DataFrame (e.g. bokeh functions).

Instead, I propose defining and documenting a python NewType:

from typing import NewType

VisitsDF = NewType('VisitsDF', pd.DataFrame)
# Document columns and dtypes here

This will make VisitsDF look like a subclass of pd.DataFrame from the point of view of a type checker, so if you pass a plain pd.DataFrame to a functions with a type hint of VisitsDF the type checker will complain, but it will be fine if you pass a VisitsDF as argument v to a function with a type hint of pd.DataFrame for argument v.

From the point of view of the runtime python interpreter, it VisitsDF will just be a synonym for pd.DataFrame.

So, using NewType will document what is going on and help pre-commit checks catch issues, but won't actually interfere with using it (unlike creating a class).

Alternately, I am aware of pandera, a module for enforcing specific schema for data frames. I have never tried it, but would be open to it.

@ehneilsen
Copy link
Collaborator Author

I will move the construction of the DataFrames from the notebook into functions in schedview that return instances of VisitsDF (or whatever I call it): I agree that moving as much code as possible out of the notebooks and into schedview proper where it can be type checked, unit tested, etc. is a good idea. I just don't want the object I get back from these functions to be hard to deal with with standard tools the way a custom class would be.

@rhiannonlynne
Copy link
Member

I don't really like the creation of the NewType, because if you really want it to be a data frame for use in other places, then it should just be a data frame .. making it a specific type seems likely to lead to (at some point) relying on a specific property (either data value or behaviour) of the NewType .. and then it's not just a data frame anymore anyway.

Ok - leave it as a pandas data frame, but indicate where the simulated_visits ought to be coming from (from querying for all of the simulated visits in the night) and then .. does it make sense to just combine the simulated and completed visits when returning from the assign_hpid_field_centers function?
Or otherwise have another function which does the concat in the way you want it to (with the addition of the extra metadata), indicate that for the computing of offsets, you should use the output of that new function?

@ehneilsen
Copy link
Collaborator Author

ehneilsen commented Mar 25, 2026

I don't really like the creation of the NewType, because if you really want it to be a data frame for use in other places, then it should just be a data frame .. making it a specific type seems likely to lead to (at some point) relying on a specific property (either data value or behaviour) of the NewType .. and then it's not just a data frame anymore anyway.

The reason why I proposed using NewType instead of creating a subclass was to reduce the temptation to add or change actual functionality in the subclass. Another option would be to use a TypeAlias, in which case the python interpreter wouldn't even be able to tell the difference:

>>> from typing import TypeAlias
... 
... class Foo():
...     pass
... 
... Bar: TypeAlias = Foo
... 
... bar = Bar()
... foo = Foo()
... 
... foo.__class__ == bar.__class__, isinstance(foo, Bar), isinstance(bar, Foo)
... 
(True, True, True)
>>> foo.__class__.__name__
'Foo'
>>> bar.__class__.__name__
'Foo'

Or, we can leave it as a DataFrame. I'm happy with any of these.

So yes, the best answer may be just for me to add better docstrings to the functions that take the DataFrame as arguments describing exactly what columns of what types it uses, and what they should mean, and referencing functions that can be used to produce conformant DataFrames.

I am in the process of writing such functions now (moving more code from the notebooks to schedview).

@rhiannonlynne
Copy link
Member

I moved some of the discussion about matching to slack here since maybe this is a question that would benefit from other viewpoints.
I can see your point, but I also think that I would have said any visit that came some relatively large fraction of the FOV to our requested visit was actually a "match" to a given simulation.
Also, I verified that the offsets between requested ra/dec and acquired ra/dec range up to 4" or even 4.5" for some nights ... maybe a "safe" value would be a minimum of 5" for a precise target/boresight matchup.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants